Homework 1 - Maciej Paczóski¶

In [1]:
import pandas as pd
import numpy as np
import dalex
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
In [2]:
df = pd.read_csv("housing.csv")
df.describe()
Out[2]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

Data prepocessing¶

In [3]:
df = df.dropna()
In [4]:
df["ocean_proximity"].unique()
Out[4]:
array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)
In [5]:
le = LabelEncoder()
df["ocean_proximity"] = le.fit_transform(df["ocean_proximity"])
In [6]:
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Model training¶

In [7]:
regr = RandomForestRegressor(n_estimators=5, random_state=0)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
MSE = metrics.mean_squared_error(y_test, y_pred)

Breakdown and shapley plots¶

In [8]:
exp = dalex.Explainer(regr, X_test, y_test)
Preparation of a new explainer is initiated

  -> data              : 4087 rows 9 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 4087 values
  -> model_class       : sklearn.ensemble._forest.RandomForestRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x0000028752051FC0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 4.41e+04, mean = 2.08e+05, max = 5e+05
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -3.16e+05, mean = -6.51e+02, max = 3.19e+05
  -> model_info        : package sklearn

A new explainer has been created!
d:\coding\daily\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
In [9]:
observation = X_test.iloc[[100, 200]]
observation_pred = regr.predict(observation)
order = X_test.columns.to_list()

First observation¶

In [10]:
exp.predict_parts(observation.iloc[[0]], type="break_down", order=order).plot()
exp.predict_parts(observation.iloc[[0]], type="shap").plot()

Break down plot shows that longitude has largest positive contribution to prediction while latitude has largest negative contribution. We may however expect that those two variables are interacting and they comes down to location, not just single coordinate. Total rooms variable impact changes from positive to negative between plots which suggest that it may be interacting with another variable.

Second observation¶

In [11]:
exp.predict_parts(observation.iloc[[1]], type="break_down", order=order).plot()
exp.predict_parts(observation.iloc[[1]], type="shap").plot()

Most variables have opposite to the first observation influence on prediction. Latitude and longitude variables contributions again suggest interaction.